This is the 3rd of the blog posts series that talks about ARM64 performance investigation for .NET 5. You can read previous blogs at:
In this post, I will describe an important compiler phase “peephole optimization” and how much improvement it can bring to the generated machine code using examples of ARM64 code generated by .NET runtime.
Introduction
Peephole optimization is a classic compiler optimization done at the very end of code generation. The concept is very simple, imagine seeing a small set of generated instructions through a peephole and exploring possibilities of changing them with better and performant instructions. An example would be easy to demonstrate.
Below is the generated code for C#’s HashSet.Contains()
method:
...
mov w22, #0
ldr x23, [x19,#16]
ldr x24, [x19,#24]
cbnz x24, G_M49529_IG15
cbz x20, G_M49529_IG04
mov x0, x20
...
Here, there are two ldr
instructions that loads 64-bit values from memory [x19 + 16]
and [x19 + 24]
into x23
and x24
registers respectively. Thus after execution, x23
will contain 8-bytes from memory location [x19+16]
thru [x19+23]
and x24
will contain 8-bytes from memory location [x19+24]
thru [x19+31]
. If you refer the ARM software optimization guide, the latency for ldr
instruction is 4 cycles. So two instructions takes 8 cycles to complete the loading of 128-bit data from memory into registers. Is it possible to replace these two instructions with a cheaper instruction? ldp
instruction on the other hand loads pair of 64-bit registers from memory. The latency of ldp
is 4 cycles which is same as single ldr
instruction. So above two ldr
instructions can be replaced with a single ldp
instruction to get the same values in registers x23
and x24
.
1
2
3
4
5
6
7
...
52800016 mov w22, #0
F9400A77 ldp x23, x24, [x19,#16] # <-- replacement
B5000BF8 cbnz x24, G_M49529_IG15
B4000174 cbz x20, G_M49529_IG04
AA1403E0 mov x0, x20
...
In this example, the compiler, while scanning the generated code, can notice that there are two ldr
instructions back to back such that they load the data from subsequent memory location and replace it with cheaper (in this case fewer cycles) ldp
instruction. The optimizer should be cautious to not perform any optimization if the two ldr
instructions do not load values from consecutive memory location.
Such optimizations are known as peephole optimizations. Such replacements look simple and often seems to be of little value in desktop application but its impact can be dramatic in cloud apps. Imagine the code produced for your app has few hundred such pair of ldr
and some of it gets called in a loop or hot function million times, by doing such optimization, you have saved 400 billion CPU cycles (100 * 1,000,000 * 4) during the entire lifetime of the app.
.NET and peephole
As I discussed in strategy of performance investigation previously, inspecting the generated code was the key to improve performance of ARM64. I started with saving ARM64 generated code of all the .NET Core libraries in a file. This was a gigantic file of 1 GB. I wrote small tool to parse this file to fetch the ARM64 code for a given .NET Core library method that I am interested in. With this tool, I started studying ARM64 code of various methods and that is when I noticed redundant instructions that could have been optimized using peephole optimization. RyuJIT, unfortunately at the time of writing this blog, did not have peephole optimization phase. It would have taken man-months of work to write one in RyuJIT. So, an interesting question that I was facing was, how should I find out the benefit of a peephole optimization to the framework library without actually implementing it in RyuJIT? I needed this data to prioritize various peephole optimizations that can be done and will it be even worth to consider writing such optimizer in RyuJIT.
AnalyzeAsm
Gladly, I figured out an easy way to get this data. Since I already had a single file containing ARM64 generated code for framework libraries, all I had to write were various scanners that would scan this file and find instruction patterns for which peephole optimization can be done. I wrote AnalyzeAsm, a C# utility that would do this job and print out library methods and portion of instructions that can be optimized. It proved as the best utility I invested my time in, because it helped me find at least 11 peephole optimization opportunities so far.
Peephole opportunities
Let us walk through some of the opportunities that AnalyzeAsm tool helped to discover.
1. Optimize “ldr+ldr” to “ldp”
This is the same example that I have given above. Optimize pair of ldr
instructions with a single ldp
instruction. If you refer this and this issue, you can see the output of AnalyzeAsm
that tells portion of instructions along with framework library method names in which they are present. For this optimization, it found approxiametely 34,000 such pairs in 16,000 methods. That is a huge number and is definitely an important optimization worth implementing.
2. Optimize “str+str” to “stp”
This optimization opportunity is like the one above, except that instead of loading the values from memory, the instruction str
stores value into memory. stp
can perform the same operation on pair of registers if storing their values in subsequent memory. More details along with number of occurrences of pair of str
can be seen in this and this.
The following code
1
2
str x13, [x12]
str x14, [x12,#8]
can be optimized to
1
stp x13, x14, [x12]
3. Optimize “str wzr + str wzr” with “str xzr”
wzr
is 4-byte zero register while xzr
is an 8-byte zero register in ARM64. If 4-byte zero value is being stored in memory location followed by another 4-byte of zero value in subsequent 4-bytes location, then the pair of such instruction can be replaced with a single store of 8-byte zero value. Details in this issue.
The following code
1
2
str wzr, [x2, #8]
str wzr, [x2, #12]
can be optimized to
1
str xzr, [x2, #8]
4. Optimize redundant mov
I noticed mov
instruction pattern, where a mov
instruction was moving value from reg1
to reg2
and the next mov
instruction was moving from reg2
to reg1
. The second mov
instruction is unnecessary and can be removed. Details in this issue. We now optimize this pattern in .NET 5.
The following code
1
2
mov x20, x2
mov x2, x20
can be optimized to
1
mov x20, x2
Note: RyuJIT already optimizes and remove mov
done between same registers i.e. mov x0, x0
.
5. Optimize and remove redundant load/store
Another pattern that I saw was loading a value from memory location into a register and then storing that value back from the register into same memory location. The second instruction is redundant and can be removed. Likewise, if there is a store followed by a load. Details in this and this issue. We are in process of addressing this optimization.
The following code
1
2
ldr w0, [x19, #64]
str w0, [x19, #64]
can be optimized to
1
ldr w0, [x19, #64]
6. Optimize memory loads with mov
RyuJIT rarely generates code that will load two registers from same memory location. The second load instruction can be converted to mov
instruction which is cheaper and does not need memory access. Details in this issue.
The following code
1
2
ldr w1, [fp,#28]
ldr w0, [fp,#28]
can be optimized to
1
2
ldr w1, [fp,#28]
mov w0, w1
Conclusion
Peephole optimization are not just limited to these optimizations but can be broadened to achieve much more optimal code. They look very simple, but can have tremendous result on the generated code. Another important lesson I leared was that by writing a simple utility tool in small timeframe, I was able to get concrete data points to justify if such optimizations are worth implementing in RyuJIT. An alternate was either to spend man-months implementing this optimizations and measure the wins or ignore them, both of which were risky.
Namaste!